{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "\n",
    "api_key = os.getenv('MY_API_KEY')\n",
    "api_base = os.getenv('MY_API_BASE')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#langchain\n",
    "* prompt,prompt模板\n",
    "* 加载外部文件，怎么解决文件token数超出chatgpt limit限制\n",
    "* 怎么管理/使用外部工具\n",
    "* 集成向量数据库\n",
    "\n",
    "https://python.langchain.com/docs/get_started/introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "langchain就是设计了一套框架/API,来解决大模型应用开发当中碰到的问题，做到开箱即用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "直接进入代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install langchain==0.0.335"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts.chat import ChatPromptTemplate\n",
    "from langchain.chat_models.openai import ChatOpenAI\n",
    "\n",
    "template = \"You are a helpful assistant that translate {input_lang} to {output_lang}\"\n",
    "human_template = \"{text}\"\n",
    "\n",
    "chat_template = ChatPromptTemplate.from_messages([\n",
    "    (\"system\",template),\n",
    "    (\"human\",human_template)\n",
    "])\n",
    "chain = chat_template | ChatOpenAI(openai_api_base=api_base,openai_api_key=api_key)\n",
    "mag = chain.invoke({\"input_lang\":\"English\",\"output_lang\":\"Chinese\",\"text\":\"I love programming\"})\n",
    "print(mag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这段代码，主要涉及到两个组件，一个是llm模型，ChatOpenAI,一个是prompt template；“|”是langchain的表达式(LCEL ,langchain expresion language),表达式是langchian新版的特性，可以更加方便/灵活的组件执行链"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面一个一个模块开始介绍"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model I/O\n",
    "一个经典的执行链基本上包括三个最小组成部分\n",
    "* prompt/输入管理\n",
    "* 模型预测推理\n",
    "* 输出的parser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## prompt管理，PromptTemplate\n",
    "主要有两类\n",
    "* PromptTemplate\n",
    "* ChatPromptTemplate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PromptTemplate\n",
    "主要用于非chat场景，无状态"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "prompt_template = PromptTemplate.from_template(\n",
    "    \"Tell me a {adjective} joke about {content}\"\n",
    ")\n",
    "print(prompt_template.format(adjective=\"funny\",content=\"chikens\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ChatPromptTemplate\n",
    "主要用于chat类场景，有状态"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import ChatPromptTemplate\n",
    "chat_prompt_template = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"system\",\"You are a helpful assistant AI bot, Your name is {name}\"),\n",
    "        (\"human\",\"Hello, how are you doing?\"),\n",
    "        (\"ai\",\"I'm doing greate!\"),\n",
    "        (\"human\",\"{user_input}\"),\n",
    "    ]\n",
    ")\n",
    "print(chat_prompt_template.format_messages(name=\"Bob\",user_input=\"What's your name?\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import ChatPromptTemplate,HumanMessagePromptTemplate,SystemMessagePromptTemplate\n",
    "from langchain.schema.messages import SystemMessage,AIMessage,HumanMessage\n",
    "chat_prompt_template = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        SystemMessagePromptTemplate.from_template(\"You are a helpful assistant AI bot, Your name is {name}\"),\n",
    "        HumanMessage(content=\"Hello, how are you doing?\"),\n",
    "        AIMessage(content=\"I'm doing greate!\"),\n",
    "        HumanMessagePromptTemplate.from_template(\"{user_input}\")\n",
    "    ]\n",
    ")\n",
    "print(chat_prompt_template.format_messages(name=\"Bob\",user_input=\"What's your name?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然有些prompt是随时可变，有些prompt是初始化一次就好的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt_template = PromptTemplate.from_template(\"translate words to {lang}:{words}\")\n",
    "partial_template = prompt_template.partial(lang = \"Chinese\")\n",
    "\n",
    "print(partial_template.format(words=\"I lobe programming\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## FewShotChatMessagePromptTemplate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import FewShotChatMessagePromptTemplate\n",
    "\n",
    "exmaples = [\n",
    "    {\"input\":\"2+2\",\"output\":\"4\"},\n",
    "    {\"input\":\"2+6\",\"output\":\"8\"},\n",
    "    {\"input\":\"10+10\",\"output\":\"20\"},\n",
    "]\n",
    "\n",
    "example_template = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"human\",\"{input}\"),\n",
    "        (\"ai\",\"{output}\")\n",
    "    ]\n",
    ")\n",
    "\n",
    "few_shot_template = FewShotChatMessagePromptTemplate(\n",
    "    example_prompt=example_template,\n",
    "    examples=exmaples\n",
    ")\n",
    "\n",
    "final_template = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"system\",\"You're a wondrous wizard of math\"),\n",
    "        few_shot_template,\n",
    "        (\"human\",\"{input}\")\n",
    "    ]\n",
    ")\n",
    "print(final_template.format(input=\"99+99\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "展开来说，few-shop的目的是为了让大模型从上下文中学习，更好的完成任务；为了使用更好的例子，可以用example-selector,结合向量数据库一起使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install tiktoken"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "from langchain.prompts import SemanticSimilarityExampleSelector\n",
    "from langchain.vectorstores.chroma import Chroma\n",
    "\n",
    "exmaples = [\n",
    "    {\"input\":\"2+2\",\"output\":\"4\"},\n",
    "    {\"input\":\"2+6\",\"output\":\"8\"},\n",
    "    {\"input\":\"10+10\",\"output\":\"20\"},\n",
    "    {\"input\":\"tell me a joke\",\"output\":\"haha\"}\n",
    "]\n",
    "\n",
    "# 将例子存入向量数据库\n",
    "to_vec= [\" \".join(example.values()) for example in exmaples]\n",
    "embeddings = OpenAIEmbeddings(openai_api_base=api_base,openai_api_key=api_key)\n",
    "vector_store = Chroma.from_texts(to_vec,embeddings,metadatas=exmaples)\n",
    "\n",
    "# 检索最优的例子\n",
    "exmaple_selector = SemanticSimilarityExampleSelector(\n",
    "    vectorstore=vector_store,\n",
    "    k = 1,\n",
    ")\n",
    "\n",
    "exmaple_selector.select_examples({\"input\":\"1+1\"})\n",
    "\n",
    "few_shot_template = FewShotChatMessagePromptTemplate(\n",
    "    input_variables=[\"input\"],\n",
    "    example_selector=exmaple_selector,\n",
    "    example_prompt=ChatPromptTemplate.from_messages(\n",
    "        [\n",
    "            (\"human\",\"{input}\"),\n",
    "            (\"ai\",\"{output}\")\n",
    "        ]\n",
    "    )\n",
    ")\n",
    "\n",
    "final_prompt = ChatPromptTemplate.from_messages([\n",
    "     (\"system\",\"You're a wondrous wizard of math\"),\n",
    "     few_shot_template,\n",
    "     (\"human\",\"{input}\")\n",
    "    ]\n",
    ")\n",
    "print(final_prompt.format(input=\"99+99\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然，exmaple_selector是可以自定义的，准确来说所有组件都可以自定义；"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 大模型/LLM\n",
    "* LLM\n",
    "* ChatModel\n",
    "llm直接用于推理预测，chat接口，多role交互"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ChatModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chat = ChatOpenAI(api_key=api_key,api_base=api_base)\n",
    "\n",
    "messages = [\n",
    "    SystemMessage(content=\"You're a helpful assistant\"),\n",
    "    HumanMessage(content=\"tell me a joke\")\n",
    "]\n",
    "chat.invoke(messages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LLM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.llms.openai import OpenAI\n",
    "llm = OpenAI(api_base=api_base,api_key=api_key)\n",
    "llm.invoke(\"tell me a joke\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为资源是很贵的，所以缓存也是当然的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.cache import InMemoryCache\n",
    "from langchain.globals import set_llm_cache\n",
    "from langchain.callbacks import get_openai_callback\n",
    "\n",
    "# 设置缓存\n",
    "set_llm_cache(InMemoryCache())\n",
    "llm = OpenAI(api_base=api_base,api_key=api_key)\n",
    "\n",
    "with get_openai_callback() as cb:\n",
    "    print(llm.invoke(\"tell me a joke\"))\n",
    "    print(cb)\n",
    "\n",
    "with get_openai_callback() as cb:\n",
    "    print(llm.invoke(\"tell me a joke\"))\n",
    "    print(cb)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然，缓存也可以从内存缓存数据库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.cache import SQLiteCache\n",
    "set_llm_cache(SQLiteCache(database_path=\".langchain.db\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "也可以使用huggingface上的开源模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from langchain.llms.huggingface_hub import HuggingFaceHub\n",
    "\n",
    "# llm = HuggingFaceHub(repo_id=\"\",huggingfacehub_api_token=\"\")\n",
    "# llm.invoke()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# output parser\n",
    "就是模型输出的解析。开发大模型应用必备组件，通常在大模型进行格式化输出后，接着就可以执行业务逻辑，如使用外部外部工具等"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.output_parsers.json import SimpleJsonOutputParser\n",
    "\n",
    "json_prompt = PromptTemplate.from_template(\n",
    "    \"Return a JSON object with an 'answer' that answer the following question:{question}\"\n",
    ")\n",
    "json_parser = SimpleJsonOutputParser()\n",
    "json_chain = json_prompt | OpenAI(openai_api_base=api_base,openai_api_key=api_key) | json_parser\n",
    "json_chain.invoke({\"question\":\"what's hugging face\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " 自定义output parser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.schema import BaseOutputParser\n",
    "\n",
    "class CustomOutPutParser(BaseOutputParser):\n",
    "    \"\"\"parser llm output\"\"\"\n",
    "\n",
    "    def parse(self, text: str) -> str:\n",
    "        \"\"\"parse the output call\"\"\"\n",
    "        return text.strip().split(\" \")\n",
    "    \n",
    "json_prompt = PromptTemplate.from_template(\n",
    "    \"Return a JSON object with an 'answer' that answer the following question:{question}\"\n",
    ")\n",
    "custom_parser = CustomOutPutParser()\n",
    "json_chain = json_prompt | OpenAI(openai_api_base=api_base,openai_api_key=api_key) | custom_parser\n",
    "json_chain.invoke({\"question\":\"what's hugging face\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这一期讲了langchain执行链最基本的三个组件\n",
    "* promptTemplate\n",
    "* llm/chat\n",
    "* ouputparser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Retrieval\n",
    "RAG(Retrival argument generate)检索增强生成。\n",
    "* 文本导入\n",
    "    * token过长怎么办？\n",
    "    * 对文本怎么切分处理？\n",
    "* 外部工具\n",
    "* 向量数据库\n",
    "    * Chroma"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 文档加载"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install pypdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## langchain提供了多种文档加载器\n",
    "* csv\n",
    "* file directory\n",
    "* html\n",
    "* json\n",
    "* markdown\n",
    "* pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import CSVLoader,DirectoryLoader,UnstructuredHTMLLoader,JSONLoader,UnstructuredMarkdownLoader,PyPDFLoader,TextLoader\n",
    "\n",
    "# csv\n",
    "#loader = CSVLoader(file_path=\"\")\n",
    "#data = loader.load()\n",
    "\n",
    "# file directory\n",
    "#loader = DirectoryLoader('',glob=\".txt\",loader_cls=TextLoader)\n",
    "\n",
    "# html\n",
    "#loader = UnstructuredHTMLLoader(file_path=\"\")\n",
    "\n",
    "# json\n",
    "#loader = JSONLoader(file_path=\"\")\n",
    "\n",
    "# markdown\n",
    "#loader = UnstructuredMarkdownLoader(file_path=\"\")\n",
    "\n",
    "# pdf_loader\n",
    "loader = PyPDFLoader(file_path=\"./transformer.pdf\")\n",
    "data = loader.load()\n",
    "\n",
    "print(data)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Document transformers文本转换/分割\n",
    "大模型token限制，需要分割文档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install tiktoken==0.3.3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "tiktoken是openai开发的分词器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "\n",
    "loader = PyPDFLoader(file_path=\"./transformer.pdf\")\n",
    "\n",
    "text_slitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
    "    chunk_size = 100,chunk_overlap=10,separator = \" \"\n",
    ")\n",
    "pages = loader.load_and_split(text_slitter)\n",
    "print(len(pages))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "怎么对一个大文本进行摘要总结呢？如果不适用langchain已有的组件，最简单的方式就是便利每一个文本块总结。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts.prompt import PromptTemplate\n",
    "from langchain.llms.openai import OpenAI\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "import time\n",
    "\n",
    "prompt_template = PromptTemplate.from_template(\n",
    "    \"Here's your first summary:{pre_response}\"\n",
    "    \"Now add to it based on the following context:{context}\"\n",
    ")\n",
    "\n",
    "llm = OpenAI(api_key=api_key,api_base=api_base)\n",
    "\n",
    "def dfs(docs,cur_doc_index,summary):\n",
    "    if cur_doc_index == len(docs):\n",
    "        print(f\"final summary: {summary}\")\n",
    "        return\n",
    "    \n",
    "    document = docs[cur_doc_index]\n",
    "    prompt_str = prompt_template.format_prompt(pre_response = summary,context=document.page_content)\n",
    "    summary = llm.invoke(prompt_str)\n",
    "    time.sleep(3)\n",
    "    print(f\"each summary {cur_doc_index}: {summary}\")\n",
    "    dfs(docs,cur_doc_index + 1,summary)\n",
    "\n",
    "loader = PyPDFLoader(file_path=\"./transformer.pdf\")\n",
    "text_slitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
    "    chunk_size = 200,chunk_overlap=10,separator = \" \"\n",
    ")\n",
    "pages = loader.load_and_split(text_slitter)\n",
    "dfs(pages[2:5],0,\"\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "langchain提供了文本总结的组件，分别有\n",
    "* Stuff\n",
    "* Refine\n",
    "* Map Reduce\n",
    "* Map Re-Rank"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stuff\n",
    "stuff就是暴力总结，将文本一股脑传给大模型，不管有没有超出limit限制\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Stuff\n",
    "from langchain.prompts import PromptTemplate\n",
    "from langchain.schema.prompt_template import format_document\n",
    "from langchain.schema import Document\n",
    "\n",
    "text = \"\"\"Nuclear power in space is the use of nuclear power in outer space, typically either small fission systems or radioactive decay for electricity or heat. Another use is for scientific observation, as in a Mössbauer spectrometer. The most common type is a radioisotope thermoelectric generator, which has been used on many space probes and on crewed lunar missions. Small fission reactors for Earth observation satellites, such as the TOPAZ nuclear reactor, have also been flown.[1] A radioisotope heater unit is powered by radioactive decay and can keep components from becoming too cold to function, potentially over a span of decades.[2]\n",
    "\n",
    "The United States tested the SNAP-10A nuclear reactor in space for 43 days in 1965,[3] with the next test of a nuclear reactor power system intended for space use occurring on 13 September 2012 with the Demonstration Using Flattop Fission (DUFF) test of the Kilopower reactor.[4]\n",
    "\n",
    "After a ground-based test of the experimental 1965 Romashka reactor, which used uranium and direct thermoelectric conversion to electricity,[5] the USSR sent about 40 nuclear-electric satellites into space, mostly powered by the BES-5 reactor. The more powerful TOPAZ-II reactor produced 10 kilowatts of electricity.[3]\n",
    "\n",
    "Examples of concepts that use nuclear power for space propulsion systems include the nuclear electric rocket (nuclear powered ion thruster(s)), the radioisotope rocket, and radioisotope electric propulsion (REP).[6] One of the more explored concepts is the nuclear thermal rocket, which was ground tested in the NERVA program. Nuclear pulse propulsion was the subject of Project Orion.[7]\n",
    "\n",
    "Regulation and hazard prevention[edit]\n",
    "After the ban of nuclear weapons in space by the Outer Space Treaty in 1967, nuclear power has been discussed at least since 1972 as a sensitive issue by states.[8] Particularly its potential hazards to Earth's environment and thus also humans has prompted states to adopt in the U.N. General Assembly the Principles Relevant to the Use of Nuclear Power Sources in Outer Space (1992), particularly introducing safety principles for launches and to manage their traffic.[8]\n",
    "\n",
    "Benefits\n",
    "\n",
    "Both the Viking 1 and Viking 2 landers used RTGs for power on the surface of Mars. (Viking launch vehicle pictured)\n",
    "While solar power is much more commonly used, nuclear power can offer advantages in some areas. Solar cells, although efficient, can only supply energy to spacecraft in orbits where the solar flux is sufficiently high, such as low Earth orbit and interplanetary destinations close enough to the Sun. Unlike solar cells, nuclear power systems function independently of sunlight, which is necessary for deep space exploration. Nuclear-based systems can have less mass than solar cells of equivalent power, allowing more compact spacecraft that are easier to orient and direct in space. In the case of crewed spaceflight, nuclear power concepts that can power both life support and propulsion systems may reduce both cost and flight time.[9]\n",
    "\n",
    "Selected applications and/or technologies for space include:\n",
    "\n",
    "Radioisotope thermoelectric generator\n",
    "Radioisotope heater unit\n",
    "Radioisotope piezoelectric generator\n",
    "Radioisotope rocket\n",
    "Nuclear thermal rocket\n",
    "Nuclear pulse propulsion\n",
    "Nuclear electric rocket\n",
    "\"\"\"\n",
    "doc_prompt = PromptTemplate.from_template(\"{page_content}\")\n",
    "docs = [Document(page_content=split) for split in text.split(\"\\n\")]\n",
    "chain = {\"content\" : lambda docs: \"\\n\".join(format_document(doc,doc_prompt) for doc in docs)} | PromptTemplate.from_template(\"Summarize the following content:\\n{content}\")\n",
    "print(chain.invoke(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Refine\n",
    "这个其实就和例子中的递归本质上一样的，先总结第一个文本，然后带这上一个总结以及下一个文本合并总结，以此递归"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from functools import partial\n",
    "from operator import itemgetter\n",
    "from langchain.schema import StrOutputParser\n",
    "\n",
    "text = \"\"\"Nuclear power in space is the use of nuclear power in outer space, typically either small fission systems or radioactive decay for electricity or heat. Another use is for scientific observation, as in a Mössbauer spectrometer. The most common type is a radioisotope thermoelectric generator, which has been used on many space probes and on crewed lunar missions. Small fission reactors for Earth observation satellites, such as the TOPAZ nuclear reactor, have also been flown.[1] A radioisotope heater unit is powered by radioactive decay and can keep components from becoming too cold to function, potentially over a span of decades.[2]\n",
    "\n",
    "The United States tested the SNAP-10A nuclear reactor in space for 43 days in 1965,[3] with the next test of a nuclear reactor power system intended for space use occurring on 13 September 2012 with the Demonstration Using Flattop Fission (DUFF) test of the Kilopower reactor.[4]\n",
    "\n",
    "After a ground-based test of the experimental 1965 Romashka reactor, which used uranium and direct thermoelectric conversion to electricity,[5] the USSR sent about 40 nuclear-electric satellites into space, mostly powered by the BES-5 reactor. The more powerful TOPAZ-II reactor produced 10 kilowatts of electricity.[3]\n",
    "\n",
    "Examples of concepts that use nuclear power for space propulsion systems include the nuclear electric rocket (nuclear powered ion thruster(s)), the radioisotope rocket, and radioisotope electric propulsion (REP).[6] One of the more explored concepts is the nuclear thermal rocket, which was ground tested in the NERVA program. Nuclear pulse propulsion was the subject of Project Orion.[7]\n",
    "\n",
    "Regulation and hazard prevention[edit]\n",
    "After the ban of nuclear weapons in space by the Outer Space Treaty in 1967, nuclear power has been discussed at least since 1972 as a sensitive issue by states.[8] Particularly its potential hazards to Earth's environment and thus also humans has prompted states to adopt in the U.N. General Assembly the Principles Relevant to the Use of Nuclear Power Sources in Outer Space (1992), particularly introducing safety principles for launches and to manage their traffic.[8]\n",
    "\n",
    "Benefits\n",
    "\n",
    "Both the Viking 1 and Viking 2 landers used RTGs for power on the surface of Mars. (Viking launch vehicle pictured)\n",
    "While solar power is much more commonly used, nuclear power can offer advantages in some areas. Solar cells, although efficient, can only supply energy to spacecraft in orbits where the solar flux is sufficiently high, such as low Earth orbit and interplanetary destinations close enough to the Sun. Unlike solar cells, nuclear power systems function independently of sunlight, which is necessary for deep space exploration. Nuclear-based systems can have less mass than solar cells of equivalent power, allowing more compact spacecraft that are easier to orient and direct in space. In the case of crewed spaceflight, nuclear power concepts that can power both life support and propulsion systems may reduce both cost and flight time.[9]\n",
    "\n",
    "Selected applications and/or technologies for space include:\n",
    "\n",
    "Radioisotope thermoelectric generator\n",
    "Radioisotope heater unit\n",
    "Radioisotope piezoelectric generator\n",
    "Radioisotope rocket\n",
    "Nuclear thermal rocket\n",
    "Nuclear pulse propulsion\n",
    "Nuclear electric rocket\n",
    "\"\"\"\n",
    "\n",
    "# 对第一个文本总结\n",
    "first_prompt = PromptTemplate.from_template(\"Summarize this content:\\n{context}\")\n",
    "\n",
    "# 读取doc\n",
    "doc_prompt = PromptTemplate.from_template(\"{page_content}\")\n",
    "partial_format_doc = partial(format_document,prompt = doc_prompt)\n",
    "\n",
    "#执行链\n",
    "summary_chain = {\"context\":partial_format_doc} | first_prompt | llm\n",
    "\n",
    "# 循环所有页\n",
    "refine_template = PromptTemplate.from_template(\n",
    "    \"Here's your first summary:{pre_response}\"\n",
    "    \"Now add to it based on the following context:{context}\"\n",
    ")\n",
    "\n",
    "# 执行链\n",
    "refine_chain = {\"pre_response\":itemgetter(\"pre_response\"),\"context\":lambda x:partial_format_doc(x['doc'])} | refine_template | llm\n",
    "\n",
    "def loop(docs):\n",
    "    summary = summary_chain.invoke(docs[0])\n",
    "    # 轮询剩下的\n",
    "    for i , doc in enumerate(docs[1:]):\n",
    "        summary = refine_chain.invoke({\"pre_response\":summary,\"doc\":doc})\n",
    "        print(summary)\n",
    "    return summary\n",
    "\n",
    "docs = [Document(page_content=split) for split in text.split(\"\\n\")]\n",
    "print(loop(docs[:3]))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Map Reduce\n",
    "分为两个阶段，首先是map阶段，对每一个文档进行总结，reduce阶段将之前的所有总结合并\n",
    "## Map Re-Rank\n",
    "这种主要用在问答场景\n",
    "map阶段对每一个文本进行打分，将分数最高的文本作为提示词进行作答"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有更加方便，开箱即用的API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains.summarize import load_summarize_chain\n",
    "\n",
    "loader = PyPDFLoader(\"./transformer.pdf\")\n",
    "text_slitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100,chunk_overlap=0,separator=\" \")\n",
    "docs = loader.load_and_split(text_slitter)\n",
    "\n",
    "chain = load_summarize_chain(llm,chain_type=\"refine\",verbose=True)\n",
    "\n",
    "chain.run(docs[:3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# embedding & vector store & retriver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "\n",
    "embeddings_model = OpenAIEmbeddings(api_key=api_key,api_base=api_base)\n",
    "\n",
    "embeddings = embeddings_model.embed_documents(\n",
    "    [\n",
    "        \"Hi,Dog name is haha\",\n",
    "        \"Cat name's hehe\",\n",
    "        \"My name is xixi\"\n",
    "    ]\n",
    ")\n",
    "embedding_query = embeddings_model.embed_query(\"what's your name?\")\n",
    "embedding_query[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "embeddings_model同样也支持cache"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.storage import InMemoryStore\n",
    "from langchain.embeddings import CacheBackedEmbeddings\n",
    "store = InMemoryStore()\n",
    "\n",
    "embeddings_model = OpenAIEmbeddings(api_key=api_key,api_base=api_base)\n",
    "\n",
    "embedder = CacheBackedEmbeddings.from_bytes_store(embeddings_model,store)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores.chroma import Chroma\n",
    "\n",
    "pdf_loader = PyPDFLoader(\"./transformer.pdf\")\n",
    "text_slitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100,chunk_overlap=0,separator=\" \")\n",
    "pdf_docs = pdf_loader.load_and_split(text_slitter)\n",
    "db = Chroma.from_documents(documents=pdf_docs[:5],embedding=OpenAIEmbeddings(api_key=api_key,api_base=api_base))\n",
    "query = \"what's transformer?\"\n",
    "result= db.similarity_search(query)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后是检索API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores.chroma import Chroma\n",
    "\n",
    "pdf_loader = PyPDFLoader(\"./transformer.pdf\")\n",
    "text_slitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100,chunk_overlap=0,separator=\" \")\n",
    "pdf_docs = pdf_loader.load_and_split(text_slitter)\n",
    "db = Chroma.from_documents(documents=pdf_docs[:5],embedding=OpenAIEmbeddings(api_key=api_key,api_base=api_base))\n",
    "\n",
    "retriever = db.as_retriever()\n",
    "\n",
    "print(retriever.invoke(\"what's transformer?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后，结合rag+llm做一个demo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"Hi,Dog name is haha\\nCat name's hehe\\nMy name is xixi\\n\"\n",
    "text_slitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100,chunk_overlap=0,separator=\" \")\n",
    "texts = text_slitter.split_text(text)\n",
    "\n",
    "em = OpenAIEmbeddings(api_key=api_key,api_base=api_base)\n",
    "db = Chroma.from_texts(texts,em)\n",
    "retriever = db.as_retriever()\n",
    "\n",
    "retriever_docs = retriever.invoke(\"what's your name?\")\n",
    "\n",
    "template = \"Answer the question based only on the following context:\\n{context}\\nQuestion:{question}\"\n",
    "\n",
    "prompt = ChatPromptTemplate.from_template(template)\n",
    "model = ChatOpenAI(api_key=api_key,api_base=api_base)\n",
    "\n",
    "chain = {\"context\":retriever_docs} | prompt | model | StrOutputParser()\n",
    "\n",
    "chain.invoke(\"name?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然langchain也提供了多种的检索器，最后介绍一种比较有用的，WebResearchRetriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.retrievers.web_research import WebResearchRetriever\n",
    "from langchain.utilities.bing_search import BingSearchAPIWrapper\n",
    "from langchain.chains import RetrievalQAWithSourcesChain\n",
    "\n",
    "vector_store= Chroma(embedding_function=OpenAIEmbeddings(api_key=api_key,base_url=api_base),persist_directory=\"./chrome_db\")\n",
    "\n",
    "llm = ChatOpenAI(api_key=api_key,base_url=api_base)\n",
    "\n",
    "search = BingSearchAPIWrapper(bing_search_url=\"https://api.bing.microsoft.com/v7.0/search\",bing_subscription_key=os.getenv('BING_API_KEY'))\n",
    "\n",
    "wrr = WebResearchRetriever.from_llm(vectorstore=vector_store,llm=llm,search=search)\n",
    "\n",
    "user_input = \"How do LLM Agent work?\"\n",
    "qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=wrr)\n",
    "result = qa_chain.invoke({\"question\":user_input})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# next\n",
    "* chain的原理，以及tools user\n",
    "* agent模式，原理"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
